Goto

Collaborating Authors

 Tirana




Four of the Strangest AI Moments in 2025

TIME - Tech

Pillay is an editorial fellow at TIME. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana on Sept. 18, 2025. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana on Sept. 18, 2025. Pillay is an editorial fellow at TIME. It's been three years since the launch of ChatGPT gave hundreds of millions of people access to a kind of digital genie in their pocket--and things have been getting stranger by the month. Besides billions of AI-generated emails and the technology's widespread disruption of education and cognitive work, in 2025, some people began to fall in love with their AIs.


Data Heterogeneity and Forgotten Labels in Split Federated Learning

Tirana, Joana, Tsigkari, Dimitra, Noguero, David Solans, Kourtellis, Nicolas

arXiv.org Artificial Intelligence

In Split Federated Learning (SFL), the clients collaboratively train a model with the help of a server by splitting the model into two parts. Part-1 is trained locally at each client and aggregated by the aggregator at the end of each round. Part-2 is trained at a server that sequentially processes the intermediate activations received from each client. We study the phenomenon of catastrophic forgetting (CF) in SFL in the presence of data heterogeneity. In detail, due to the nature of SFL, local updates of part-1 may drift away from global optima, while part-2 is sensitive to the processing sequence, similar to forgetting in continual learning (CL). Specifically, we observe that the trained model performs better in classes (labels) seen at the end of the sequence. We investigate this phenomenon with emphasis on key aspects of SFL, such as the processing order at the server and the cut layer. Based on our findings, we propose Hydra, a novel mitigation method inspired by multi-head neural networks and adapted for the SFL's setting. Extensive numerical evaluations show that Hydra outperforms baselines and methods from the literature.


Aligning LLMs for Multilingual Consistency in Enterprise Applications

Agarwal, Amit, Meghwani, Hansa, Patel, Hitesh Laxmichand, Sheng, Tao, Ravi, Sujith, Roth, Dan

arXiv.org Artificial Intelligence

Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.


Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs

Hari, Vishnu, Panda, Kalpana, Panda, Srikant, Agarwal, Amit, Patel, Hitesh Laxmichand

arXiv.org Artificial Intelligence

Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.


The World's First AI-Powered Minister Tests the Future of Government

TIME - Tech

Pillay is an editorial fellow at TIME. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana, on September 18, 2025. Albania's new AI-generated minister Diella speaks during the parliamentary session for the voting of the new government of Albania, in Tirana, on September 18, 2025. Pillay is an editorial fellow at TIME. In September, Albania appointed an AI system to a cabinet-level position--a world-first. Called Diella (Albanian for "sun"), the system was declared "Minister of State for Artificial Intelligence," and tasked by Albania's Prime Minister with addressing corruption in government contracting.



EEFSUVA: A New Mathematical Olympiad Benchmark

Khatibi, Nicole N, Radamovich, Daniil A., Brenner, Michael P.

arXiv.org Artificial Intelligence

Recent breakthroughs have spurred claims that large language models (LLMs) match gold medal Olympiad to graduate level proficiency on mathematics benchmarks. In this work, we examine these claims in detail and assess the extent to which current benchmarks capture genuine LLM mathematical reasoning. The composition of these benchmarks, primarily drawing from the International Mathematics Olympiad (IMO) and related competitions, may overstate models reasoning ability due to potential data contamination and a narrow focus on familiar problem types. To enable a more holistic assessment of mathematical understanding, we introduce EEFSUVA, a novel benchmark curated from under circulated regional and national Olympiads of Eastern Europe and the countries from the former Soviet Union. These contests feature problems of comparable difficulty to the IMO and are renowned for demanding nonstandard problem-solving techniques, yet their problems are far less prevalent in online corpora. Preliminary results suggest that even state-of-the-art LLMs exhibit a notable performance decline on EEFSUVA relative to other Olympiad-style benchmarks. These findings also suggest the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning and for guiding future model development.


ASR Under Noise: Exploring Robustness for Sundanese and Javanese

Pranida, Salsabila Zahirah, Airlangga, Muhammad Cendekia, Genadi, Rifo Ahmad, Shehata, Shady

arXiv.org Artificial Intelligence

We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements